Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML

نویسندگان

  • Matthias Boehm
  • Shirish Tatikonda
  • Berthold Reinwald
  • Prithviraj Sen
  • Yuanyuan Tian
  • Douglas Burdick
  • Shivakumar Vaithyanathan
چکیده

SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables—in contrast to existing large-scale machine learning libraries— automatic optimization. SystemML’s primary focus is on data parallelism but many ML algorithms inherently exhibit opportunities for task parallelism as well. A major challenge is how to efficiently combine both types of parallelism for arbitrary ML scripts and workloads. In this paper, we present a systematic approach for combining task and data parallelism for large-scale machine learning on top of MapReduce. We employ a generic Parallel FOR construct (ParFOR) as known from high performance computing (HPC). Our core contributions are (1) complementary parallelization strategies for exploiting multi-core and cluster parallelism, as well as (2) a novel cost-based optimization framework for automatically creating optimal parallel execution plans. Experiments on a variety of use cases showed that this achieves both efficiency and scalability due to automatic adaptation to ad-hoc workloads and unknown data characteristics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML

Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators—in terms of fused chains of basic operators—are ubiquitous. These opportunities include (1) fewer materialized intermediates, (2) fewer scans of input d...

متن کامل

SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs

SystemML enables declarative, large-scale machine learning (ML) via a high-level language with R-like syntax. Data scientists use this language to express their ML algorithms with full flexibility but without the need to hand-tune distributed runtime execution plans and system configurations. These ML programs are dynamically compiled and optimized based on data and cluster characteristics usin...

متن کامل

SystemML: Declarative Machine Learning on Spark

The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express cu...

متن کامل

Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs

Declarative large-scale machine learning (ML) aims at the specification of ML algorithms in a high-level language and automatic generation of hybrid runtime execution plans ranging from single node, in-memory computations to distributed computations on MapReduce (MR) or similar frameworks like Spark. The compilation of large-scale ML programs exhibits many opportunities for automatic optimizati...

متن کامل

Deep Learning with Apache SystemML

Deep Learning (DL) is a subfield of Machine Learning (ML) that focuses on learning hierarchical representations of data with multiple levels of abstraction using neural networks [15]. Recent advances in deep learning are made possible due to the availability of large amounts of labeled data, use of GPGPU compute, and application of new techniques (such as ReLU, batch normalization [12], dropout...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2014